EDR DIRECTION ESTIMATING METHOD, SYSTEM, AND PROGRAM, 
AND MEMORY MEDIUM FOR STORING THE PROGRAM 



BACKGROUND OF THE INVENTION: 

The present invention relates to a method and system for 
estimating EDR directions in a single-index model, and more particularly to a 
method, system, and program for estimating EDR directions in a single-index 
model related to a large number of variables, and a memory medium for 
storing the program. 

In general, one of objects of statistical analysis of actual 
phenomena is to find relationships among various characteristics and make 
a prediction. In such a case, it is frequent practice to find any relationship 
from data using regression analysis and make a prediction on certain 
variables. For example, linear regression analysis or logistic regression 
analysis is used to analyze the relationship between a response variable y 
and an explanatory variable x. 

However, the higher the dimension p of the explanatory variable x, 
the more difficult it is to perform this type of regression analysis. To solve 
this problem, there have been proposed several methods to reduce the 
number of dimensions of the explanatory variable x. 

For example, referring to the following document 1 (Ker-Chau Li, 
"Sliced inverse regression for dimension reduction," Journal of the American 
Statistical Association, Vol. 86 (414), pp. 316-342, 1991.), Ker-Chau Li 
proposed SIR (Sliced Inverse Regression). 

SIR is a method for determining a subspace of x enough to 
describe the response variable y. The subspace determined. is called EDR 



(Effective Dimension Reduction) space, and a vector spanning the EDR 
space is called an EDR direction vector. Using conventional regression 
analysis, the relationship between the response variable y and the 
explanatory variable x in the EDR space, the dimension of which has been 
reduced, can be found out 

Referring also to the following document 2 (Ichimura et. al., 
"Optimal Smoothing in Single Index Models," The Annals of Statistics, Vol. 
21, pp. 157-178, 1993.), Hall and Ichimura estimated EDR directions using a 
smoothing method. 

Referring further to the following document 3 (Xia et al., "An 
adaptive estimation of dimension reduction space," Journal of the Royal 
Statistical Society (Series B), Vol. 64, pp. 363-410, 2002.), Xia et al. 
proposed a technique for estimating the EDR space using a non-linear 
smoothing method. However, if the number of explanatory variables 
becomes enormous, it will be very difficult to make calculations. 

SIR will be described below. In the SIR method, a model 
indicated by the following equations (1 ) to (6) is assumed. 
« i 

y = f(P 1 x,..,p k x,e) (1) 

In this equation, y represents a response variable, f is an 
unknown function, e is a random variable independent of x, and x is a p- 
dimensional explanatory variable. Further, (3 lt p k are p-dimensional 
unknown coefficient vectors, that is, EDR direction vectors. 

Using Figs. 1 and 2, SIR operations will be described below. First, 
explanatory variables in a data file inputted from an input device 1 are 
standardized by data standardizing means 24 of a data analyzer 2 (step Al 
in Fig. 2): 

Zi=S 2 [*i-x] (i = t....n) (2) 
xx 



where x is a varianee-covariance matrix, average of x, respectively. 

XX 

Then slice average calculating means 22 sorts response variables 
y. and divides them into H slices I x I H (step A2). Then the proportion 

of response variables belonging to slice I k is calculated as P|< (see the 
following equation (3)): 



Pk^ZMyj) (3) 



where 5 k (y,) is 5 k ( yi ) = fJ yi 6 ' k ', 



2> (4) 

y^k 



Next, using the following equation (4), the mean vector of 
standardized explanatory variables is calculated for each slice (step A3). 
J_ 
l n Pk_|, 

Then, principle component analyzing means 25 carries out a 
principle component analysis of the mean vectors m on a slice basis to 
determine eigen vectors (step A4). 

In this case, the characteristic numbers and eigen vectors are 
determined using the following equation (5): 
. H 

V=XPkrn k m' k (5) 
k=1 

The data standardizing means 24 extracts K eigen vectors r| k (k = 
1, .... K) with characteristic numbers in descending numeric order, and uses 
the following equation (6) to transform them into the original coordinate 
system (step A5): 

i 

Pk=S" 2 Hk (6) 

XX 

The EDR direction vectors determined at step A5 are outputted on 
an output device 3 (step A6). 
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The first problem of the above-mentioned prior art is that SIR is 
not applicable to data having a large number of variables such as a DNA 
chip for gene expression analysis or a micro array. In order to standardize 
data, SIR requires the inverse matrix of the variance-covariance matrix of 
5 explanatory variables, and a principle component analysis for estimating 

EDR direction vectors to determine eigen vectors. However, if the variables 
are enormous in number, it may be mathematically impossible to determine 
the inverse matrix of the variance-co variance matrix, or the principle 
component analysis may take enormous computation time. 
10 The second problem is that SIR limits the distribution of 

explanatory variables to elliptic distributions. Therefore, SIR cannot be 
applied when explanatory variables are binary. 

SUMMARY OF THE INVENTION: 

15 It is an object of the present invention to provide a method and 

system, which estimates EDR directions with simple calculations, without 
using the inverse matrix of the variance-covariance matrix and principle 
component analysis, when the number of slice is two in a single index model 
to be represented by the equation below. The single index model means a 

20 model, which consists of one unknown coefficient vector and contains 
conventional multiple linear regression analysis and logistic regression 
analysis. 

The single index model can be represented by the following 
equation (7): 
25 y = f(p , 0 X,£) ( 7 ) 

where y is a response variable, f is an unknown, comprehensive, monotone 
function, £ is a random variable independent of x, and x is a p-dimensional 
explanatory variable. Further, (3 0 is a p-dimensional unknown coefficient 
vector, that is, a true EDR direction vector. 



It is another object of the present invention not to assume any 
particular form of distributions of explanatory variables x so that the EDR 
direction estimating system of the present invention can be applied even 
when the explanatory variables are binary. 

It is still another object of the present invention to provide a 
technique and system for searching important genes based on data having a 
large number of variables such as a DNA chip for gene expression analysis 
or a micro array. 

An EDR direction estimating system according to the present 
invention includes an input device for inputting a data file to be analyzed, a 
data analyzer operated under program control, and an output device. In this 
system, the data analyzer includes 

data conversion means, which receives data to be analyzed, the 
data composed of sets of response variables and explanatory variables, 
standardizes the explanatory variables, and outputs data composed of sets 
of standardized explanatory variables and response variables, 

slice average calculating mean, which takes in the data composed 
of the sets of standardized explanatory variables and response variables, 
divides the data into two slices with reference to a predetermined threshold 
for the response variables, calculates the mean vector of the standardized 
explanatory variables on a slice basis, and outputs the mean vector for each 
slice, and 

EDR direction calculating means, which takes in the mean vector 
for each slice, calculates the difference between the two mean vectors to 
determine an EDR direction, and outputs the EDR direction data to the data 
conversion means, such that 

the data conversion means converts the EDR direction data to a 
unit vector and outputs the unit vector to the output device as an estimated 
value for the EDR direction. 



An EDR direction estimating method according to the present 
invention includes the steps of: 

inputting a data file to be analyzed; 

receiving data to be analyzed, the data composed of sets of 
response variables and explanatory variables, standardizing the explanatory 
variables, and outputting data composed of sets of standardized explanatory 
variables and response variables; 

receiving the data composed of the sets of standardized 
explanatory variables and response variables, dividing the data into two 
slices with reference to a predetermined threshold for the response variables, 
calculating the mean vector of the standardized explanatory variables on a 
slice basis, and outputting the mean vector for each slice; 

receiving the mean vector for each slice, calculating the difference 
between the two mean vectors to determine an EDR direction, and 
outputting the EDR direction data to the data conversion means; and 

converting the EDR direction data to a unit vector and outputting 
the unit vector as an estimated value for the EDR direction. 

BRIEF DESCRIPTION OF THE DRAWING 

Fig. 1 is a block diagram showing a prior art structure. 

Fig. 2 is a flowchart showing the operation of the prior art. 

Fig. 3 is a block diagram showing the structure according to a first 
embodiment of the present invention. 

Fig. 4 is a flowchart showing the operation of the first embodiment 
of the present invention. 

Fig. 5 is a block diagram showing the structure according to a fifth 
embodiment of the present invention. 

Fig. 6 is a scatter plot showing data crated by a model. 

Fig. 7 is a scatter plot of z (1) and z {2) . 



Fig. 8 is a scatter plot of response variables versus estimated 
EDR directions. 

Fig. 9 is a scatter plot of response variables versus EDR 
directions corrected by a correlation matrix. 

DESCRIPTION OF THE PREFERRED EMBODIMENT 

A first embodiment of the present invention will now be described 
with reference to the accompanying drawings. Referring to Fig. 3, an EDR 
direction estimating system according to the first embodiment of the present 
invention includes an input device 1 for inputting a data file to be analyzed, a 
data analyzer 2 operated under program control, and an output device 3 
such as a display device and/or printer. The data file to be analyzed is 
composed of N sets of data, each set consisting of one response variable 
and p-dimensional explanatory variable or covariate. The data analyzer 2 
includes data conversion means 21 , slice average calculating means 22, 
and EDR direction calculating means 23. 

The data conversion means 21 standardizes the N p-drmensional 
covariates in the data file given, and sends data composed of sets of 
standardized covariates and response variables to the slice average 
calculating means 22. The data conversion means 21 transforms the EDR 
direction given by the EDR direction calculating means 22 and a corrected 
EDR direction into the original coordinate system, and further converts them 
to unit vectors, and outputs them to the output device 3. 

The slice average calculating means 22 divides the N sets of data 
into two slices with reference to the median of the response variables. The 
slice average calculating means 22 further calculates the mean vector of the 
p-dimensional covariates in each slice, and sends them to the EDR direction 
calculating means 23. 

The EDR direction calculating means 23 determines the 
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difference between the two mean vectors given by the slice average 
calculating means 22. An EDR direction is determined from this calculation. 
The EDR direction calculating means 23 further determines the correlation 
matrix of the p-dimensional covariates. Then, if can calculate the inverse 
5 matrix of the correlation matrix, the EDR direction calculating means 23 will 
correct the EDR direction using the inverse matrix of the correlation matrix, 
and send both the EDR direction and the corrected EDR direction to the data 
conversion means 21. On the other hand, if cannot calculate the inverse 
matrix of the correlation matrix, the EDR direction calculating means 23 will 

10 send only the EDR direction to the data conversion means 21. 

Referring next to Figs. 3 and 4, the operation of the embodiment 
will be described in detail. It is assumed that the data in the data file to be 
analyzed are represented by the following equation (8): 
(y il x i ),i=1 ( ... ( N (8) 

15 where y is a response variable and Xj is a p-dimensiona( covariate. The data 
to be analyzed are sent to the data conversion means 21. The data 
conversion means 21 standardizes covariates Xj® as represented in the 
following equation (9) using a sampled average of the covariates p(j) and a 
variance (d^) 2 : 



«) x P-* (i) 



It is assumed in this equation that x } = (x/ 1) , ... , Xi (p) ), and the 

sampled average P(j) and the variance (a^) 2 are given by the following 
equations (1 0) and (1 1 ) respectively (step A1 in Fig. 4): 
N 



POUj-lL do) 
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£<xp-pto> 2 

The slice average calculating means 22 divides, into two slices l H 
and l Ll the response variables Vj in the data to be analyzed, according to the 
following equation (12): 

s l H = {i|yi>t,iel} ( l L = {i|yi<Uel} (12) 

where the threshold t takes the median of y and I = {1, N} (step A2). 

Then, the mean vectors m|-| , nri|_ of the standardized 
covariates zj are calculated for respective slices I H and l L according to the 
following equation (13): 

10 * H= i2>" m L=rf Z z i. (13) 

In this equation, N H represents the number of data belonging to l Hl 
and N L = N - N H , and Z x = (Zi (1) Z ; (1) ) (step A3). 

Then, according to the following equation (14), the EDR direction 
calculating means 23 calculates the difference between the mean vectors 
1 5 determined at step A3 (step A4): 

q = l(m H -m L ) (14) 

Next, at step A5, the correlation matrix Q of the covariates is 

calculated. 

Then, if can determine the inverse matrix of the correlation matrix 

20 Cl at step A6, the EDR direction calculating means 23 will use the inverse 
matrix to correct H according to the following equation (15) (step A7): 

Hn =Q~ 1 n (15) 

On the other hand, if cannot determine the inverse matrix of the 
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correlation matrix ft , the procedure goes to step A8. The data conversion 



means 21 transforms the determined H and HNj into the original coordinate 
system, and standardizes them into unit vectors according to the following 
equation (16) (step A8): 

1 1 



1 




1 


i"2n 




i~ 2 n N 



(16) 



where i = diag{(a< 1 >f (a< K >f } and = diagj^, .... J^} . 

The determined vectors are outputted on the output device 3 as 
estimated values for EDR directions. 

The output device 3 displays or prints out a graph showing plots . 
10 of response variables versus mappings (scores) q'x and n/^ * of the 
covariates x in the EDR directions H and fj^ , 

The effects of the embodiment will next be described. In the 
embodiment, the EDR directions can be estimated without principle 
component analysis, so that complicated matrix calculations do not need 
15 performing, thereby saving a lot of calculation time. Further, the mean 
vectors and the different between the mean vectors have only to be 
calculated, so that EDR directions for data having a large number of 
variables, to which SIR is not applicable, can be estimated. 

A second embodiment of the present invention will next be 
20 described. In the second embodiment, a mean value is used as the 
threshold t for the division into slices. The structure of the second 
embodiment is the same as that of the first embodiment A different point is 
that, while the median is used as the threshold t for the division into slices in 
the operation of the first embodiment, a mean value is used as the threshold 



t in the operation of the second embodiment. 

The effect of this embodiment will be described below. When the 
distribution of response variables y is skewed for bath large values and small 
values, the use of the median for the division into slices in the first 
embodiment may not be able to divide both the skewed distributions property. 
On the other hand, since the mean value is used for the division into slices in 
the second embodiment, both the skewed distributions can be divided 
properly. 

A third embodiment of the present invention will next be described. 
In the third embodiment, the threshold t for the division into slices takes 0.5 
when the responses are binary, either 0 or 1 . The structure of the third 
embodiment is the same as that of the first embodiment A different point is 
that, while the median is used as the threshold t for the division into slices in 
the operation of the first embodiment (step A2 in Fig. 4), 0.5 is used as the 
threshold t in the operation of the third embodiment. 

The effect of this embodiment will be described below. When the 
response variables are binary, either 0 or 1 , the use of the median for the 
division into slices in the first embodiment results in slice division by 0 or 1. 
On the other hand, since 0.5 is used for the division into slices in this 
embodiment, the response variables can be divided into a slice for 0s and a 
slice for 1 s. 

A fourth embodiment of the present invention will next be 
described. The fourth embodiment is to cope with missing values. The 
structure of the fourth embodiment is the same as that of the first 
embodiment. A point different from the operation of the first embodiment is 
that when data are standardized (step A1 in Fig. 4), divided into slices (step 
A2), and the mean vector is calculated for each slice (step A3), missing 
values are removed from these calculations in this embodiment. 

With respect to the effect of this embodiment since only the 
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missing values are removed from the data to be analyzed, individual data 
containing the missing values can be effectively used for analysis without 
removing the individual data themselves. 

Referring to Fig. 5, a fifth embodiment of the present invention will 
s next be described in detail. Like the first to fourth embodiments, the fifth 
embodiment of the present invention includes the input device, the data 
analyzer, and the output device. In addition, this embodiment also includes 
a memory medium 4 with a data analyzing program on it. The memory 
medium 4 may be either transportable or fixed. For example, it may be a 
10 magnetic disk, semiconductor memory, CD-ROM, or any other memory 
medium. 

A computer program capable of executing this method may also 
be stored in a storage device on a computer connected to a network so that 
it can be transferred to a storage device on another computer through the 
15 network. The medium providing the computer program executing this 

algorithm can be distributed in the form of a medium readable on a variety of 
computers, and should not be limited to a particular type of medium. 

The data analyzing program is read from the memory medium 4 
into a data analyzer 5 to control the operation of the data analyzer 5 to 
20 perform the same processing on data file inputted from the input device 1 as 
the data analyzer 2 does in the first to fourth embodiments. 

The above-mentioned first embodiment will next be specifically 
described with reference to simulation results. A simulation model used in 
the embodiment is represented by the following equation (1 7): 

y = Uex P (W £ < 17 > 

where e ~- N is (0, 0.05 2 ), n, 0 and z are represented by the following 
equation (18), and O(p) is determined according to the following equation . 
(19). 



- 13 - 



no = 1. o)\ z = (z™ z< 6 > ) ~ N{o,n( P ) } 



(18) 



( 1 p 0 OOO^ 
p 1 0 0 0 0 

oo 1 -p o o (19) 

0 0 -p 1 0 0 
0 0 0 0 1 0 
lo 0 0 0 0 1 J 

It is assumed here that n 0 is a true EDR direction, and N (0, 1) 
5 represents a normal distribution with average 0 variance 1. 

Fig. 6 is a scatter plot of data (data to be analyzed) created by this 
model. In Fig. 6, N = 50 and p = 0.8, and the response variable y versus r) 0 ' 
z (abscissa) is plotted. In other words, the true EDR direction rjo'z is 
plotted on the abscissa and the response variable y is plotted on the 
10 ordinate. Here, ri 0 'z is called scores in the true EDR direction. The present 
invention is applied to the data on the scores. 

Fig. 7 is a scatter plot of z (1) and z (2) after the response variables 
are divided into two slices (step A2 in Fig. 4) and the mean vector is 
calculated for each slice (step A3). The marks "O" indicate the mean 
15 vectors rflj-j and n\ where H and L represent whether corresponding 

response variables are larger or smaller than the median. In Fig. 7, only z {1) 
and z (2) are shown from among six-dimensional covariates z. 

Fig. 8 is a scatter plot of response variables y versus scores n' z 

(abscissa) in the EDR direction H estimated from the difference between 
20 the mean vectors (step A4), in which n' z is plotted on the abscissa and the 

response variable is plotted on the ordinate. 

Fig. 9 is a scatter plot of response variables y versus scores h'n 2 

in the EDR direction n' z corrected by the correlation matrix. As is apparent 

from comparisons among Figs. 6, 8, and 9 that the true EDR direction can 
25 be estimated using the present invention. In Fig. 9, fj' N z is plotted on the 
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abscissa and the response variable is plotted on the ordinate. 

The following table (1) shows mean values and standard 
deviations of correlation coefficients between scores in the true EDR 
direction and scores in the estimated EDR direction (where N = 50, 100, 500, 
and p = 0.0, 0.8 in 100,000 tries), and mean values and standard deviations 
of correlation coefficients between scores in the estimated EDR direction 
and two-valued response variables (where N = 50, 100, 500, and p = 0.0, 
0.8 in 100,000 tries). Representing the two-valued response variables by 5, 
the following equation (20) is given: 

Table 1 



N 


P 


= 0.0 


P 


= 0.8 


Cor(rj , z,n/ 0 


z) Cor(n'z,5) 


Cor.(rTz,n , o 


z) Cor(n'z,5) 


50 


0.936 


0.803 


0.921 


0.769 




(0.039) 


(0.034) 


(0.032) 


(0.039) 


100 


0.967 


0.799 


0.935 


0.762 




(0.021) 


(03023) 


(0.020) 


(0.027) 


500 


0.993 


0.798 


0:946 


0.758 




(0.004) 


(0.010) 


(0.007) 


(0.012) 



5 = l-i,y<t < 20 > 

Here, the threshold t is the median of the response variables, 
showing mean values and standard deviations of correlation coefficients in 
the variations of N = 50, 100, 500, and p = 0.0, 0.8 in 100,000 analytical tries, 
respectively. The above table 1 shows that the correlation coefficients 
between scores in the true EDR direction and scores in the estimated EDR 
direction are close to 1 , and the variances are small values. It can be found 
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from these facts that the true EDR direction can be estimated using the 
present invention. 

The above table (1 ) also shows that the correlation coefficients 
between scores in the estimated EDR direction and two-valued response 
5 variables do not vary very much even as the number of samples increases. 
It can be found from this fact that the EDR direction can be estimated 
regardless of the number of data. 

According to the present invention, the inverse matrix of the 
variance-covariance matrix is not used to standardize data in a single index 
10 model, so that the data can be standardized using only the average and 
variance of the data, thereby standardizing data with a large number of 
variables. 

Also, according to the present invention, the EDR direction when 
the number of slices is two can be determined without carrying out the 

15 principle component analysis. In other words, the EDR direction can be 

determined just by calculating the difference between the mean vectors, and 
this makes it possible to determine EDR direction when the number of slices 
is two in a single index model composed of a large number of variables. The 
computing speed is improved as well. 

20 For the above-mentioned reasons, the technique can be applied 

to data with a large number of variables such as a DNA chip for gene 
expression analysis or a micro array. When it is applied to data in a micro 
array, the response variable y takes forms of expression such as side effects 
and x represents the amount of expression of each gene obtained by the 

25 micro array. With respect to coefficients in the EDR direction obtained, it 
shows that gene A with a large coefficient has a more significant impact on 
the forms of expression than gene B with a small coefficient, that is, gene A 
is more important than gene B. Thus, depending on the magnitude of 
coefficients, genes important to the forms of expression can be searched. 



